library("here")
library("dplyr")
library("ggplot2")
library("ade4")
library("kableExtra")
library("SummarizedExperiment")
library("phyloseq")
library("tidyverse")
library("GGally")
library("pheatmap")
library("factoextra")
library("MASS")
library("pheatmap")
library("lattice")
library("ggcorrplot")
library("ggfortify")
library("ggrepel")
library("pracma")
library("dslabs")
library("ggridges")
By mesuring and analyzing several variables on the same same subjects, we are able to identify valuable patterns, dependencies, connections or associations between the different variables.
Such analyses are made easy in R as the data are usually represented as matrices
However the major challenge in biodata analysis is the data comes in huge dimensions of variation. Thus the need for leveraging between dimentionality reduction and information loss.
This Chapter focuese on Principal Component Analysis (PCA) as a dimension reduction method using basic data but also data from high-throughput experiments. Geometric explanations of the PCA as well as visualizations that help interprete the output of PCA analyses are presented.
View examples of matrices that come up in the study of biological data.
Perform dimension reduction to understand correlations between variables.
Preprocess, rescale and center the data before starting a multivariate analysis.
Build new variables, called principal components (PC), that are more useful than the original measurements.
See what is “under the hood” of PCA: the singular value decomposition of a matrix.
Visualize what this decomposition achieves and learn how to choose the number of principal components.
Run through a complete PCA analysis from start to finish.
Project factor covariates onto the PCA map to enable a more useful interpretation of the results.
Data used in this Chapter are;
-Turtles: A matrix of three dimensions of biometric measurements on painted turtles.
turtles = read.table(here("Book","data","PaintedTurtles.txt"), header = TRUE)
dim(turtles)
## [1] 48 4
turtles[1:4, ]
## sex length width height
## 1 f 98 81 38
## 2 f 103 84 38
## 3 f 103 86 42
## 4 f 105 86 40
load(here("Book","data","athletes.RData"))
athletes[1:3, ]
## m100 long weight highj m400 m110 disc pole javel m1500
## 1 11.25 7.43 15.48 2.27 48.90 15.13 49.28 4.7 61.32 268.95
## 2 10.87 7.45 14.97 1.97 47.71 14.46 44.36 5.1 61.76 273.02
## 3 11.18 7.44 14.20 1.97 48.29 14.81 43.66 5.2 64.16 263.20
load(here("Book","data","Msig3transp.RData"))
round(Msig3transp,2)[1:5, 1:6]
## X3968 X14831 X13492 X5108 X16348 X585
## HEA26_EFFE_1 -2.61 -1.19 -0.06 -0.15 0.52 -0.02
## HEA26_MEM_1 -2.26 -0.47 0.28 0.54 -0.37 0.11
## HEA26_NAI_1 -0.27 0.82 0.81 0.72 -0.90 0.75
## MEL36_EFFE_1 -2.24 -1.08 -0.24 -0.18 0.64 0.01
## MEL36_MEM_1 -2.68 -0.15 0.25 0.95 -0.20 0.17
data("GlobalPatterns", package = "phyloseq")
GPOTUs = as.matrix(t(phyloseq::otu_table(GlobalPatterns)))
GPOTUs[1:4, 6:13]
## OTU Table: [8 taxa and 4 samples]
## taxa are columns
## 246140 143239 244960 255340 144887 141782 215972 31759
## CL3 0 7 0 153 3 9 0 0
## CC1 0 1 0 194 5 35 3 1
## SV1 0 0 0 0 0 0 0 0
## M31Fcsw 0 0 0 0 0 0 0 0
library("SummarizedExperiment")
data("airway", package = "airway")
assay(airway)[1:3, 1:4]
## SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003 679 448 873 408
## ENSG00000000005 0 0 0 0
## ENSG00000000419 467 515 621 365
metab = t(as.matrix(read.csv(here("Book","data","metabolites.csv"), row.names = 1)))
metab[1:4, 1:4]
## 146.0985388 148.7053275 310.1505057 132.4512963
## KOGCHUM1 29932.36 17055.70 1132.82 785.5129
## KOGCHUM2 94067.61 74631.69 28240.85 5232.0499
## KOGCHUM3 146411.33 147788.71 64950.49 10283.0037
## WTGCHUM1 229912.57 384932.56 220730.39 26115.2007
Tabulate the frequencies of zeros in the airway, GPOTUs, metab and data matrices.
| Data | Distribution of zeros |
|---|---|
| airway | 314674 |
| GPOTUs | 395038 |
| metab | 604 |
These frequencies reflect the sparsity that exist in the datasets.
► Question 7.1
What are the columns of these data matrices usually called ?
The Columns are the variables
In each of these examples, what are the rows of the matrix ?
The row are the subjects,sample or objects being studied. except in the RNA_seq example where columns have been swapped
What does a cell in a matrix represent ?
A record or measurement of a variable for a specific subject
If the data matrix is called athletes and you want to see the value of the third variable for the fifth athlete, what do you type into R?
athletes[5,3 ]
Analyzing just one column (uni dimentional) is a univariate analyses. A one number summary (mean or median) of such a column is a zero-dimensional summary.
A correlation coefficient is an example of a mesure used when considering two variables measured together on a set of observations.
► Question 7.2
Compute the matrix of all correlations between the measurements from the turtles data. What do you notice ?
cor(turtles[, -1])
## length width height
## length 1.0000000 0.9783116 0.9646946
## width 0.9783116 1.0000000 0.9605705
## height 0.9646946 0.9605705 1.0000000
► Question 7.3
Using GGally, produce all pairwise scatterplots, as well as the one-dimensional histograms on the diagonal, for the turtles data. Guess the underlying or “true dimension” of these data?
ggpairs(turtles[, -1], axisLabels = "none")
The various variables are positively correlated and all seem to positively affect the size of the turtles.
► Question 7.4
Make a pairs plot of the athletes data. What do you notice?
ggpairs(athletes, axisLabels = "none")
Using the ggpairs reveals subtle corelations which can be better be viewed with a \(pheatmap\).
pheatmap(cor(athletes), cell.width = 10, cell.height = 10)
The variables clusters them into three groups: running, throwing and jumping
Data preprocessing data is essential especially when units with different baselines and scales are involved.
Data transformation is relevant.
Examples of transformation are;
Centering: subtracting the mean, so that the mean of the centered data is at the origin.
Scaling or standardizing: dividing by the standard deviation, so that the new standard deviation is \(1\). Unlike the VST, the aim is to make the scale (as measured by mean and standard deviation) of different variables the same
► Question 7.5
Compute the means and standard deviations of the turtle data, then use the scale function to center and standardize the continuous variables. Call this scaledTurtles, then verify the new values for mean and standard deviation of scaledTurtles. Make a scatterplot of the scaled and centered width and height variables of the turtle data and color the points by their sex.
#find mean
apply(turtles[,-1], 2, sd)
## length width height
## 20.481602 12.675838 8.392837
#find standard deviation
apply(turtles[,-1], 2, mean)
## length width height
## 124.68750 95.43750 46.33333
#scale and centre the data then compute mean and sd
scaledTurtles = scale(turtles[, -1])
apply(scaledTurtles, 2, mean)
## length width height
## -1.432050e-18 1.940383e-17 -2.870967e-16
apply(scaledTurtles, 2, sd)
## length width height
## 1 1 1
#convert the scaled data into a dataframe and plot
data.frame(scaledTurtles, sex = turtles[, 1]) %>%
ggplot(aes(x = width, y = height, group = sex)) +
geom_point(aes(color = sex)) + coord_fixed()
DR was onvented by Karl Pearson to reduce a two-variable scatterplot to a single coordinate.
ubsequently used by by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects.
PCA is a widely used exploratory techique for DR via multivariate analysis.
It is an example of a geometrical projection of points from higher-dimensional spaces onto lower dimensions by a vector \(v\).
PCA is called an unsupervised learning technique because, as in clustering, it treats all variables as having the same status.
The goal is to find a linear mathematical model for an underlying structure for all the variables.
As an example, let generate a scatterplot of two variables (weigth ans dics) of the athletics data showing the projection on the horizontal x axis (defined by \(y = 0\) )
athletes = data.frame(scale(athletes))
ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
geom_point(size = 2, shape = 21)
ath_gg + geom_point(aes(y = 0), colour = "red") +
geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")
► Task
Calculate the variance of the red points in Figure 7.6.
Make a plot showing projection lines onto the \(y-axis\) and projected points.
Compute the variance of the points projected onto the vertical \(y-axis\).
#a. Variance of projected weigths
var(data.frame(scale(athletes))$weight)
## [1] 1
#b. projection onto the y axis.
athletes = data.frame(scale(athletes))
ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
geom_point(size = 2, shape = 21)
ath_gg + geom_point(aes(x = 0), colour = "purple") +
geom_segment(aes(xend = 0, yend = disc), linetype = "dashed")
#Variance of projected discs
var(data.frame(scale(athletes))$disc)
## [1] 1
-Summarizing/projecting data from two dimentions onto a line can easily lead to loss of valuable infomation about the variables. Hence the need for effective approaches.
-Regression lines are good for such projections
Regression of the disc variable on weight. This can be done with the \(lm\) function.
reg1 = lm(disc ~ weight, data = athletes)
a1 = reg1$coefficients[1] # intercept
b1 = reg1$coefficients[2] # slope
pline1 = ath_gg + geom_abline(intercept = a1, slope = b1,
col = "blue", lwd = 1.5)
pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted),
colour = "red", arrow = arrow(length = unit(0.15, "cm")))
Conversely, a regression of weight on discus is obtained by
reg2 = lm(weight ~ disc, data = athletes)
a2 = reg2$coefficients[1] # intercept
b2 = reg2$coefficients[2] # slope
pline2 = ath_gg + geom_abline(intercept = -a2/b2, slope = 1/b2,
col = "darkgreen", lwd = 1.5)
pline2 + geom_segment(aes(xend=reg2$fitted, yend=disc),
colour = "orange", arrow = arrow(length = unit(0.15, "cm")))
The relationship (i.e the slope and intercept) differs depending on which of the variables we choose to be the predictor and which the response.
► Question 7.6
How large is the variance of the projected points that lie on the blue regression line of Figure 7.7? Compare this to the variance of the data when projected on the original axes, weight and disc.
# variances of the points along the original
var(athletes$disc)
## [1] 1
var(athletes$weight)
## [1] 1
# variance of lm fitted data
var(reg2$fitted)
## [1] 0.6502039
var(reg1$fitted)
## [1] 0.6502039
A line that minimizes distances in both directions
The goal is to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto the line.
This line is called the principal component line
Notice the \(svd\) function was used here This can be obtained by:
xy = cbind(athletes$disc, athletes$weight)
svda = svd(xy)
pc = xy %*% svda$v[, 1] %*% t(svda$v[, 1])
bp = svda$v[2, 1] / svda$v[1, 1]
ap = mean(pc[, 2]) - bp * mean(pc[, 1])
ath_gg + geom_segment(xend = pc[, 1], yend = pc[, 2]) +
geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5)
The blue line minimizes the sum of squares of the vertical residuals, the green line minimizes the horizontal residuals, the purple line, called the principal component, minimizes the orthogonal projections. Notice the ordering of the slopes of the three lines.
► Question 7.7
What is particular about the slope of the purple line? Redo the plots on the original (unscaled) variables. What happens?
The purple line has a slope of 1.
#Disc onto Weight
load(here("Book","data","athletes.RData"))
d_w = ggplot(athletes, aes(x = weight, y = disc)) +
geom_point(size = 2, shape = 21)
d_w + geom_point(aes(y = 0), colour = "red") +
geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")
#Weight onto disc
w_d = ggplot(athletes, aes(x = weight, y = disc)) +
geom_point(size = 2, shape = 21)
w_d + geom_point(aes(x = 0), colour = "red") +
geom_segment(aes(yend = disc, xend = 0), linetype = "dashed")
► Question 7.8
Compute the variance of the points on the purple line.
apply(pc, 2, var)
## [1] 0.9031761 0.9031761
sum(apply(pc, 2, var))
## [1] 1.806352
Principal components are linear combinations of the variables that were originally measured, they provide a new coordinate system.
The result is a new variable, \(V\), and the coefficients are called the loadings.
Principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.
PCA is based on the principle of finding the axis showing the largest inertia/variability, removing the variability in that direction and then iterating to find the next best orthogonal axis so on.
All the axes can be found in one operation called the Singular Value Decomposition
The workflow;
the means are variances are computed and the choice of whether to work with rescaled covariances –the correlation matrix– or not has to be made
the choice of \(k\), the number of components relevant to the data is made. That is the rank of the approximation we choose.
The end results of the PCA workflow are useful maps of both the variables and the samples
t1 <- matrix(c('X',1:4,2,' ',' ',' ',' ',4,' ',' ',' ',' ',8,' ',' ',' ',' '), ncol = 4)
t2 <- t1; t2[2:5,2] <- as.numeric(t2[2:5,1]) * 2
t3 <- t2; t3[2:5,3] <- as.numeric(t3[2:5,2]) * 2
t4 <- t3; t4[2:5,4] <- as.numeric(t4[2:5,3]) * 2
tt1 <- knitr::kable(t1, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' ')) %>%
column_spec(1, border_right = T) %>%
row_spec(1, extra_css = 'border-bottom:1px solid black;', bold=FALSE) %>%
row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
add_header_above(header = c('Step 0' = 4))
tt2 <- knitr::kable(t2, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' ')) %>%
column_spec(1, border_right = T) %>%
row_spec(1, extra_css = 'border-bottom:1px solid black;') %>%
row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
add_header_above(header = c('Step 1' = 4))
tt3 <- knitr::kable(t3, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' ')) %>%
column_spec(1, border_right = T) %>%
row_spec(1, extra_css = 'border-bottom:1px solid black;') %>%
row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
add_header_above(header = c('Step 2' = 4))
tt4 <- knitr::kable(t4, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' ')) %>%
column_spec(1, border_right = T) %>%
row_spec(1, extra_css = 'border-bottom:1px solid black;') %>%
row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
add_header_above(header = c('Step 3' = 4))
knitr::kable(data.frame(as.character(tt1), as.character(tt2), as.character(tt3), as.character(tt4)),
col.names = NULL,
format = 'html',
escape = FALSE,
table.attr = 'class = "kable_wrapper"')
|
|
|
|
\(X\) has 12 elements, while in terms of \(u\) and \(v\) it can be expressed by only 7 numbers. Thus writing \(X=u∗v^t\) is more economical than spelling out the full matrix \(X\)
In reality our goal is to rather reverse the process; find the deconposed components.
Since the decomposition is not unique (there are several candidate choices for the vectors \(u\) and \(u\)). we go for the decomposition depicting orthogonalnormality.
par(mfrow=c(1,2))
knitr::include_graphics(c(here("Book","image_files","images","SVD-mosaicXplot1.png"),here("Book","image_files","images","SVD-mosaicXplot2.png"),here("Book","image_files","images","SVD-mosaicXplot3.png")))
X = matrix(c(780, 75, 540,
936, 90, 648,
1300, 125, 900,
728, 70, 504), nrow = 3)
u = c(0.8196, 0.0788, 0.5674)
v = c(0.4053, 0.4863, 0.6754, 0.3782)
s1 = 2348.2
sum(u^2)
## [1] 0.9998964
sum(v^2)
## [1] 0.9999562
s1 * u %*% t(v)
## [,1] [,2] [,3] [,4]
## [1,] 780.03419 935.92555 1299.8645 727.87794
## [2,] 74.99597 89.98406 124.9748 69.98143
## [3,] 540.00903 647.93089 899.8818 503.90183
X - s1 * u %*% t(v)
## [,1] [,2] [,3] [,4]
## [1,] -0.034187016 0.07445066 0.13548011 0.12205890
## [2,] 0.004033752 0.01594279 0.02522674 0.01856789
## [3,] -0.009026004 0.06911092 0.11819353 0.09816522
► Question
Try svd(X) in R. Look at the components of the output of the svd function carefully. Check the norm of the columns of the matrices that result from this call. Where did the above value of s1 = 2348.2 come from?
svd(X)$u[, 1]
## [1] 0.81963482 0.07881104 0.56743949
svd(X)$v[, 1]
## [1] 0.4052574 0.4863089 0.6754290 0.3782403
sum(svd(X)$u[, 1]^2)
## [1] 1
sum(svd(X)$v[, 1]^2)
## [1] 1
svd(X)$d
## [1] 2.348244e+03 2.141733e-13 6.912584e-15
We see that the second and third singular values are 0 (up to the numeric precision we care about). That is why we say that X is of rank 1.
7.6.2 How do we find such a decomposition in a unique way?
in the example above, the decomposition has three elements: the horizontal and vertical singular vectors, and the diagonal corner, called the singular value.
These can be found using the singular value decomposition function (\(svd\))
Xtwo = matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,
18, 28, 52, 36, 18, 10.5, 64.5, 36), ncol = 4, byrow = TRUE)
USV = svd(Xtwo)
► Question
Check how each successive pair of singular vectors improves our approximation to Xtwo. What do you notice about the third and fourth singular values?
Xtwo - USV$d[1] * USV$u[, 1] %*% t(USV$v[, 1])
## [,1] [,2] [,3] [,4]
## [1,] 0.87481760 19.045230 -10.1044650 1.74963521
## [2,] 0.08079747 1.759002 -0.9332405 0.16159494
## [3,] -0.04700978 -1.023427 0.5429803 -0.09401956
## [4,] 0.16159494 3.518005 -1.8664809 0.32318987
## [5,] -0.69632883 -15.159437 8.0428540 -1.39265765
Xtwo - USV$d[1] * USV$u[, 1] %*% t(USV$v[, 1]) -
USV$d[2] * USV$u[, 2] %*% t(USV$v[, 2])
## [,1] [,2] [,3] [,4]
## [1,] 7.216450e-15 -1.065814e-14 8.881784e-15 4.884981e-15
## [2,] 2.040035e-15 -5.995204e-15 1.054712e-14 3.219647e-15
## [3,] 2.865763e-15 -9.547918e-15 1.554312e-15 6.231127e-15
## [4,] 4.385381e-15 -5.773160e-15 1.776357e-14 7.049916e-15
## [5,] 5.107026e-15 -1.776357e-15 1.776357e-14 1.776357e-14
The third and fourth singular values are so small they do not improve the approximation, we can conclude that Xtwo is of rank 2.
► Task
Check the orthonormality by computing the cross product of the \(U\) and \(V\) matrices:
t(USV$u) %*% USV$u
## [,1] [,2] [,3] [,4]
## [1,] 1.000000e+00 -1.665335e-16 0.000000e+00 -8.326673e-17
## [2,] -1.665335e-16 1.000000e+00 1.665335e-16 -5.551115e-17
## [3,] 0.000000e+00 1.665335e-16 1.000000e+00 -5.551115e-17
## [4,] -8.326673e-17 -5.551115e-17 -5.551115e-17 1.000000e+00
t(USV$v) %*% USV$v
## [,1] [,2] [,3] [,4]
## [1,] 1.000000e+00 8.326673e-17 1.387779e-17 -5.551115e-17
## [2,] 8.326673e-17 1.000000e+00 -3.642919e-17 -6.938894e-17
## [3,] 1.387779e-17 -3.642919e-17 1.000000e+00 2.775558e-17
## [4,] -5.551115e-17 -6.938894e-17 2.775558e-17 1.000000e+00
Execute the \(svd\) on the rescaled turtles matrix
turtles.svd = svd(scaledTurtles)
turtles.svd$d
## [1] 11.746475 1.419035 1.003329
dim(turtles.svd$u)
## [1] 48 3
► Question
What can you conclude about the turtles matrix from the svd output?
The coefficients for the three variables are equal
The Singular Value Decomposition is
\(X = USV^t,VtV=I,U^tU=I\)
where S is the diagonal matrix of singular values, \(V^t\) is the transpose of \(V\), and I is the Identity matrix.
The principal component transformation is defined so that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components
► Question
Compute the first principal component for the turtles data by multiplying by the first singular value usv\(d[1] by usv\)u[,1]. What is another way of computing it ?
turtles.svd$d[1] %*% turtles.svd$u[,1]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -1.983668 -1.705579 -1.340216 -1.420782 -0.9423811 0.04689258 -0.09048395
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0.71721 0.8540019 0.8540019 0.5854404 0.8016959 0.7846503 0.858507
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## [1,] 1.353953 1.93447 1.808307 1.989887 2.890979 2.776548 2.907215 3.140809
## [,23] [,24] [,25] [,26] [,27] [,28] [,29]
## [1,] 3.362087 4.562009 -2.512689 -2.439124 -2.291411 -1.693556 -1.688242
## [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## [1,] -1.910913 -1.654375 -1.597856 -1.683736 -1.086174 -1.103512 -1.166447
## [,37] [,38] [,39] [,40] [,41] [,42]
## [1,] -0.7219127 -0.8307375 -0.7851402 -0.6374268 -0.8600987 -0.403541
## [,43] [,44] [,45] [,46] [,47] [,48]
## [1,] -0.4211712 -0.1937018 -0.0003910952 -0.01772898 0.1355914 0.8187415
scaledTurtles %*% turtles.svd$v[,1]
## [,1]
## [1,] -1.9836684602
## [2,] -1.7055794726
## [3,] -1.3402164350
## [4,] -1.4207818191
## [5,] -0.9423811150
## [6,] 0.0468925757
## [7,] -0.0904839545
## [8,] 0.7172099610
## [9,] 0.8540018654
## [10,] 0.8540018654
## [11,] 0.5854403531
## [12,] 0.8016958980
## [13,] 0.7846503260
## [14,] 0.8585070441
## [15,] 1.3539533202
## [16,] 1.9344701590
## [17,] 1.8083074735
## [18,] 1.9898872486
## [19,] 2.8909792543
## [20,] 2.7765475313
## [21,] 2.9072153955
## [22,] 3.1408088253
## [23,] 3.3620866667
## [24,] 4.5620089799
## [25,] -2.5126887623
## [26,] -2.4391243571
## [27,] -2.2914109209
## [28,] -1.6935561972
## [29,] -1.6882415877
## [30,] -1.9109134857
## [31,] -1.6543752488
## [32,] -1.5978564155
## [33,] -1.6837364090
## [34,] -1.0861739982
## [35,] -1.1035118831
## [36,] -1.1664470694
## [37,] -0.7219127043
## [38,] -0.8307375049
## [39,] -0.7851402035
## [40,] -0.6374267672
## [41,] -0.8600986652
## [42,] -0.4035410247
## [43,] -0.4211712224
## [44,] -0.1937018329
## [45,] -0.0003910952
## [46,] -0.0177289800
## [47,] 0.1355913785
## [48,] 0.8187414700
► Question Looking at the athelet data what part of the output of the svd functions leads us to the first PC coefficients, also known as the PC loadings ?
svda$v[,1]
## [1] -0.7071068 -0.7071068
If we rotate the (discus,weight) plane making the purple line the horizontal\(x axis\), we obtain what is know as the first principal plane.
ppdf = tibble(PC1n = -svda$u[, 1] * svda$d[1],
PC2n = svda$u[, 2] * svda$d[2])
ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + xlab("PC1 ")+
ylab("PC2") + geom_point(aes(x=PC1n,y=0),color="red") +
geom_segment(aes(xend = PC1n, yend = 0), color = "red") +
geom_hline(yintercept = 0, color = "purple", lwd=1.5, alpha=0.5) +
xlim(-3.5, 2.7) + ylim(-2,2) + coord_fixed()
segm = tibble(xmin = pmin(ppdf$PC1n, 0), xmax = pmax(ppdf$PC1n, 0), yp = seq(-1, -2, length = nrow(ppdf)), yo = ppdf$PC2n)
ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + ylab("PC2") + xlab("PC1") +
geom_hline(yintercept=0,color="purple",lwd=1.5,alpha=0.5) +
geom_point(aes(x=PC1n,y=0),color="red")+
xlim(-3.5, 2.7)+ylim(-2,2)+coord_fixed() +
geom_segment(aes(xend=PC1n,yend=0), color="red")+
geom_segment(data=segm,aes(x=xmin,xend=xmax,y=yo,yend=yo), color="blue",alpha=0.5)
# the mean sums of squares of the red segments corresponds to the square of the second singular value
svda$d[2]^2
## [1] 6.196729
#The variance of the red points is var(ppdf$PC1n), which is larger than the number caluclated in a) by design of the first PC
#We take the ratios of the standard deviations explained by the points on the vertical and horizontal axes by computing:
sd(ppdf$PC1n)/sd(ppdf$PC2n)
## [1] 3.054182
svda$d[1]/svda$d[2]
## [1] 3.054182
► Task
Use prcomp to compute the PCA of the first two columns of the athletes data, look at the output. Compare to the singular value decomposition.
prcomp(athletes[,1:2])
## Standard deviations (1, .., p=2):
## [1] 0.3452872 0.1805680
##
## Rotation (n x k) = (2 x 2):
## PC1 PC2
## m100 0.5541639 0.8324075
## long -0.8324075 0.5541639
svd(athletes[,1:2])
## $d
## [1] 76.269677 1.952953
##
## $u
## [,1] [,2]
## [1,] -0.1767451 -0.113423360
## [2,] -0.1726840 -0.226609836
## [3,] -0.1760415 -0.137000969
## [4,] -0.1694264 -0.265162962
## [5,] -0.1742018 -0.176703317
## [6,] -0.1741439 -0.354214297
## [7,] -0.1732940 0.031420171
## [8,] -0.1711520 0.038838147
## [9,] -0.1734554 -0.007063171
## [10,] -0.1754672 -0.054148604
## [11,] -0.1734581 -0.207350719
## [12,] -0.1753370 -0.093816061
## [13,] -0.1732155 -0.116244447
## [14,] -0.1734474 -0.159046280
## [15,] -0.1744533 -0.182588996
## [16,] -0.1725101 -0.006297023
## [17,] -0.1742767 0.238011364
## [18,] -0.1772543 0.160313422
## [19,] -0.1720072 0.005474335
## [20,] -0.1702281 -0.057004582
## [21,] -0.1792376 -0.008908758
## [22,] -0.1765106 0.129666020
## [23,] -0.1757169 0.073490749
## [24,] -0.1740573 0.098983727
## [25,] -0.1725717 -0.095836107
## [26,] -0.1734028 0.167696506
## [27,] -0.1719162 0.039639187
## [28,] -0.1766613 0.139487116
## [29,] -0.1731219 0.118207953
## [30,] -0.1771143 0.102187888
## [31,] -0.1702112 0.458637467
## [32,] -0.1721329 0.378954371
## [33,] -0.1785929 0.078262097
##
## $v
## [,1] [,2]
## [1,] -0.8433808 0.5373163
## [2,] -0.5373163 -0.8433808
We can now get summary statistics for the 1 and 2-dimensional data. Now we are going to answer the question about the “true” dimensionality of these rescaled data. Let’s consider the scaledTurtle data.
pcaturtles = princomp(scaledTurtles)
pcaturtles
## Call:
## princomp(x = scaledTurtles)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3
## 1.6954576 0.2048201 0.1448180
##
## 3 variables and 48 observations.
fviz_eig(pcaturtles, geom = "bar", bar_width = 0.4) + ggtitle("")
This Scree plot shows plots of eigenvalues of principal components.
Compare PCA functions have been created by different teams who worked in different areas
svd(scaledTurtles)$v[, 1]
## [1] 0.5787981 0.5779840 0.5752628
prcomp(turtles[, -1])$rotation[, 1]
## length width height
## 0.8068646 0.4947448 0.3227958
princomp(scaledTurtles)$loadings[, 1]
## length width height
## 0.5787981 0.5779840 0.5752628
dudi.pca(turtles[, -1], nf = 2, scannf = FALSE)$c1[, 1]
## [1] -0.5787981 -0.5779840 -0.5752628
► Question
From the prcomp function (call it res) are in the scores slot of the result. Take a look at PC1 for the turtles and compare it to res$scores. Compare the standard deviation sd1 to that in the res object and to the standard deviation of the scores.
res = princomp(scaledTurtles)
PC1 = scaledTurtles %*% res$loadings[,1]
sd1 = sqrt(mean(res$scores[, 1]^2))
► Question
Check the orthogonality of the res$scores matrix.Why can’t we say that it is orthonormal?
Combine both the PC scores (US) and the loadings-coefficients (V) to form a biplot (plot where both samples and variables are represented).
fviz_pca_biplot(pcaturtles, label = "var", habillage = turtles[, 1]) +
ggtitle("")
► Question
Compare the variance of each new coordinate to the eigenvalues returned by the PCA dudi.pca function.
pcadudit = dudi.pca(scaledTurtles, nf = 2, scannf = FALSE)
apply(pcadudit$li, 2, function(x) sum(x^2)/48)
## Axis1 Axis2
## 2.93573765 0.04284387
pcadudit$eig
## [1] 2.93573765 0.04284387 0.02141848
The lengths of the arrows indicate the quality of the projections onto the first principal plane:
fviz_pca_var(pcaturtles, col.circle = "black") + ggtitle("") +
xlim(c(-1.2, 1.2)) + ylim(c(-1.2, 1.2))
► Question
Explain the relationships between the number of rows of our turtles data matrix and the following numbers:
svd(scaledTurtles)$d/pcaturtles$sdev
## Comp.1 Comp.2 Comp.3
## 6.928203 6.928203 6.928203
sqrt(47)
## [1] 6.855655
7.7.2 A complete analysis: the decathlon athletes
cor(athletes) %>% round(1)
## m100 long weight highj m400 m110 disc pole javel m1500
## m100 1.0 -0.5 -0.2 -0.1 0.6 0.6 0.0 -0.4 -0.1 0.3
## long -0.5 1.0 0.1 0.3 -0.5 -0.5 0.0 0.3 0.2 -0.4
## weight -0.2 0.1 1.0 0.1 0.1 -0.3 0.8 0.5 0.6 0.3
## highj -0.1 0.3 0.1 1.0 -0.1 -0.3 0.1 0.2 0.1 -0.1
## m400 0.6 -0.5 0.1 -0.1 1.0 0.5 0.1 -0.3 0.1 0.6
## m110 0.6 -0.5 -0.3 -0.3 0.5 1.0 -0.1 -0.5 -0.1 0.1
## disc 0.0 0.0 0.8 0.1 0.1 -0.1 1.0 0.3 0.4 0.4
## pole -0.4 0.3 0.5 0.2 -0.3 -0.5 0.3 1.0 0.3 0.0
## javel -0.1 0.2 0.6 0.1 0.1 -0.1 0.4 0.3 1.0 0.1
## m1500 0.3 -0.4 0.3 -0.1 0.6 0.1 0.4 0.0 0.1 1.0
pca.ath = dudi.pca(athletes, scannf = FALSE)
pca.ath$eig
## [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
## [8] 0.3067981 0.2669494 0.1018542
fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")
fviz_pca_var(pca.ath, col.circle = "black") + ggtitle("")
athletes[, c(1, 5, 6, 10)] = -athletes[, c(1, 5, 6, 10)]
cor(athletes) %>% round(1)
## m100 long weight highj m400 m110 disc pole javel m1500
## m100 1.0 0.5 0.2 0.1 0.6 0.6 0.0 0.4 0.1 0.3
## long 0.5 1.0 0.1 0.3 0.5 0.5 0.0 0.3 0.2 0.4
## weight 0.2 0.1 1.0 0.1 -0.1 0.3 0.8 0.5 0.6 -0.3
## highj 0.1 0.3 0.1 1.0 0.1 0.3 0.1 0.2 0.1 0.1
## m400 0.6 0.5 -0.1 0.1 1.0 0.5 -0.1 0.3 -0.1 0.6
## m110 0.6 0.5 0.3 0.3 0.5 1.0 0.1 0.5 0.1 0.1
## disc 0.0 0.0 0.8 0.1 -0.1 0.1 1.0 0.3 0.4 -0.4
## pole 0.4 0.3 0.5 0.2 0.3 0.5 0.3 1.0 0.3 0.0
## javel 0.1 0.2 0.6 0.1 -0.1 0.1 0.4 0.3 1.0 -0.1
## m1500 0.3 0.4 -0.3 0.1 0.6 0.1 -0.4 0.0 -0.1 1.0
pcan.ath = dudi.pca(athletes, nf = 2, scannf = FALSE)
pcan.ath$eig
## [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
## [8] 0.3067981 0.2669494 0.1018542
fviz_pca_var(pcan.ath, col.circle="black") + ggtitle("")
fviz_pca_ind(pcan.ath) + ggtitle("") + ylim(c(-2.5,5.7))
data("olympic", package = "ade4")
olympic$score
## [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114 8093 8083 8036
## [16] 8021 7869 7860 7859 7781 7753 7745 7743 7623 7579 7517 7505 7422 7310 7237
## [31] 7231 7016 6907
The screeplot of the variances of the new variables is used.
pcaMsig3 = dudi.pca(Msig3transp, center = TRUE, scale = TRUE,
scannf = FALSE, nf = 4)
fviz_screeplot(pcaMsig3) + ggtitle("")
ids = rownames(Msig3transp)
celltypes = factor(substr(ids, 7, 9))
status = factor(substr(ids, 1, 3))
table(celltypes)
## celltypes
## EFF MEM NAI
## 10 9 11
cbind(pcaMsig3$li, tibble(Cluster = celltypes, sample = ids)) %>%
ggplot(aes(x = Axis1, y = Axis2)) +
geom_point(aes(color = Cluster), size = 5) +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
scale_color_discrete(name = "Cluster") + coord_fixed()
load(here("Book", "data", "mat1xcms.RData"))
dim(mat1)
## [1] 399 12
pcamat1 = dudi.pca(t(mat1), scannf = FALSE, nf = 3)
fviz_eig(pcamat1, geom = "bar", bar_width = 0.7) + ggtitle("")
dfmat1 = cbind(pcamat1$li, tibble(
label = rownames(pcamat1$li),
number = substr(label, 3, 4),
type = factor(substr(label, 1, 2))))
pcsplot = ggplot(dfmat1,
aes(x=Axis1, y=Axis2, label=label, group=number, colour=type)) +
geom_text(size = 4, vjust = -0.5)+ geom_point(size = 3)+ylim(c(-18,19))
pcsplot + geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2)
pcsplot + geom_line(colour = "red")
Generate a biplot of a simple data set where chemical measurements were made on different wines for which we also have a categorical wine.class variable
load(here("Book", "data", "wine.RData"))
load(here("Book", "data", "wineClass.RData"))
pheatmap(1 - cor(wine), treeheight_row = 0.2)
winePCAd = dudi.pca(wine, scannf=FALSE)
table(wine.class)
## wine.class
## barolo grignolino barbera
## 59 71 48
fviz_pca_biplot(winePCAd, geom = "point", habillage = wine.class,
col.var = "violet", addEllipses = TRUE, ellipse.level = 0.69) +
ggtitle("") + coord_fixed()
A biplot is a simultaneous representation of both the space of observations and the space of variables.
We want to see variability between different groups or observations as weighted mesurements. Lets try with the Hiiragi data.
data("x", package = "Hiiragi2013")
xwt = x[, x$genotype == "WT"]
sel = order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
xwt = xwt[sel, ]
tab = table(xwt$sampleGroup)
tab
##
## E3.25 E3.5 (EPI) E3.5 (PE) E4.5 (EPI) E4.5 (PE)
## 36 11 11 4 4
xwt$weight = 1 / as.numeric(tab[xwt$sampleGroup])
pcaMouse = dudi.pca(as.data.frame(t(Biobase::exprs(xwt))),
row.w = xwt$weight,
center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
fviz_eig(pcaMouse) + ggtitle("")
fviz_pca_ind(pcaMouse, geom = "point", col.ind = xwt$sampleGroup) +
ggtitle("") + coord_fixed()
► Exercise 7.1
7.1a If a matrix \(X\) has no rows and no columns which are all zeros, then is this decomposition unique?
No, rather the rank is one factor for uniqueness of decomposition
7.1b Generate a rank-one matrix: Start by taking a vector of length 15 with values from 2 to 30 in increments of 2, and a vector of length 4 with values 3,6,9,12, take their ‘product’
u = seq(2, 30, by = 2)
v = seq(3, 12, by = 3)
X1 = u %*% t(v)
To ensure the multiplicity of the two matrices, the the number of columns in \(v\) must equals the number of rows in \(u\). This the t() function was used to transpose \(v\).
7.1c Add some noise in the form a matrix we call Materr so we have an “approximately rank-one” matrix.
Materr = matrix(rnorm(60,1),nrow=15,ncol=4)
X = X1+Materr
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(reshape2)
library(ggplot2)
longData<-melt(X1)
longData<-longData[longData$value!=0,]
ggplot(longData, aes(x = Var2, y = Var1)) +
geom_raster(aes(fill=value)) +
scale_fill_gradient(low="grey90", high="red") +
labs(x="Columns", y="Rows", title="Matrix") +
theme_bw() + theme(axis.text.x=element_text(size=9, angle=0, vjust=0.3),
axis.text.y=element_text(size=9),
plot.title=element_text(size=11))
ggplot(longData, aes(x = Var2, y = Var1, col = value, fill = value, label = value)) +
geom_tile() +
geom_text(col = "black") +
theme_minimal() +
scale_fill_gradient2(low = "white", mid = "yellow", high = "red") +
scale_color_gradient2(low = "white", mid = "yellow", high = "red")
7.1 Redo the same analyses with a rank 2 matrix
Y= matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,18, 28, 52, 36), ncol = 4, byrow = TRUE)
longData1 <-melt(Y)
longData1 <-longData[longData$value!=0,]
ggplot(longData1, aes(x = Var2, y = Var1)) +
geom_raster(aes(fill=value)) +
scale_fill_gradient(low="grey90", high="red") +
labs(x="Columns", y="Rows", title="Matrix") +
theme_bw() + theme(axis.text.x=element_text(size=9, angle=0, vjust=0.3),
axis.text.y=element_text(size=9),
plot.title=element_text(size=11))
► Exercise 7.2 7.2 a Create highly correlated bivariate data such as that shown in Figure 7.35.
# Define parameters
µ1 = 1; µ2 = 3.5; a1=3.5; a2=1.5; ρ=0.9;
sigma = matrix(c(a1^2, a1*a2*ρ, a1*a2*ρ, a2^2),2)
bv_data = data.frame(mvrnorm(50, mu = c(µ1,µ2), sigma))
ggplot(data.frame(bv_data),aes(x=X1,y=X2)) +
geom_point()
Check the rank of the matrix by looking at its singular values.
(svd(scale(bv_data))$v)
## [,1] [,2]
## [1,] 0.7071068 0.7071068
## [2,] 0.7071068 -0.7071068
7.2 b Perform a PCA and show the rotated principal component axes.
bv_pca = prcomp(bv_data)
autoplot(bv_pca,loadings = TRUE,
loadings.colour = 'orange',loadings.label = TRUE)
#I like the autoplot becase you donot need to worry about elongation. Just use the percentage indicated
► Exercise 7.3 Part (A) in Figure 7.35 shows a very elongated plotting region, why?
The elongation is proportional to the amount of feature (covariate) variability explained by the respective axis.
What happens if you do not use the coord_fixed() option and have a square plotting zone? Why can this be misleading?
mu1 = 1; mu2 = 2; s1=2.5; s2=0.8; rho=0.9;
sigma = matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2)
sim2d = data.frame(mvrnorm(50, mu = c(mu1,mu2), Sigma = sigma))
svd(scale(sim2d))$d
## [1] 9.652671 2.196803
svd(scale(sim2d))$v[,1]
## [1] -0.7071068 -0.7071068
ggplot(data.frame(sim2d),aes(x=X1,y=X2)) +
geom_point()
respc=princomp(sim2d)
dfpc = data.frame(pc1=respc$scores[,1],
pc2 = respc$scores[,2])
ggplot(dfpc,aes(x=pc1,y=pc2)) +
geom_point() #+ coord_fixed(2)
It will be difficult to know which proportion of the covariates is explained by either components.
► Exercise 7.4 7.4a Make a correlation circle for the unweighted Hiiragi data xwt. Which genes have the best projections on the first principal plane (best approximation)?
data("x", package = "Hiiragi2013")
xwt = x[, x$genotype == "WT"]
xwt_pca = prcomp(as.data.frame(t(Biobase::exprs(xwt))))
# use screeplot to view the proportion of covariates explained by the the various componets
screeplot(xwt_pca)
fviz_pca_var(xwt_pca) + ggtitle("") + coord_fixed()
#compare this with the autoplot
autoplot(xwt_pca,loadings = TRUE)
Which genes have the best projections on the first principal plane (best approximation)?
#Not sure of this; could it be genes with the least covariate? Check the PC1
head(xwt_pca$rotation)
## PC1 PC2 PC3 PC4 PC5
## 1415670_at 0.014979074 0.002785063 0.004718063 0.0008828813 -0.0015354698
## 1415671_at 0.003735398 0.008537833 0.012276599 -0.0015362852 0.0065355311
## 1415672_at -0.004104042 0.011274405 0.002571740 -0.0015993642 0.0026008362
## 1415673_at 0.008555240 -0.004056765 0.004879796 0.0063981918 -0.0030953000
## 1415674_a_at 0.006632186 0.006343311 0.006144117 -0.0087736005 0.0033053253
## 1415675_at 0.003275449 0.001912670 0.005875177 -0.0006834972 0.0009383057
## PC6 PC7 PC8 PC9
## 1415670_at -0.0002440700 9.754898e-05 0.004405116 -0.0039091641
## 1415671_at -0.0049637968 1.275980e-03 0.009131999 -0.0018218103
## 1415672_at 0.0048257365 -2.300061e-04 -0.002329420 0.0005882527
## 1415673_at -0.0048764620 2.697455e-03 -0.003492900 0.0073228526
## 1415674_a_at -0.0005396141 -7.248981e-03 0.010478642 0.0035579033
## 1415675_at 0.0055220170 2.781140e-03 -0.006683983 0.0012973931
## PC10 PC11 PC12 PC13 PC14
## 1415670_at 0.004323296 -0.001813514 0.0055203280 0.0049499764 0.0032719238
## 1415671_at 0.002730358 -0.002895581 0.0006003783 0.0039224281 0.0007794673
## 1415672_at 0.005256812 0.004795409 0.0038925699 0.0121304146 0.0071950041
## 1415673_at 0.004030073 0.010959578 0.0009526538 -0.0009134531 0.0153952247
## 1415674_a_at -0.001662793 0.004511165 -0.0031626816 -0.0095877654 0.0094023984
## 1415675_at 0.001389425 0.003317438 -0.0053758542 -0.0003588282 0.0039269766
## PC15 PC16 PC17 PC18 PC19
## 1415670_at -0.002024541 -0.007933164 0.004693290 -0.0054003705 0.011277624
## 1415671_at 0.005724788 0.003436805 -0.002077939 -0.0007522099 0.006112120
## 1415672_at 0.004977242 0.002113614 -0.002270352 -0.0028362693 -0.004320122
## 1415673_at -0.001152743 0.007960091 -0.010852450 0.0015558016 0.002184231
## 1415674_a_at -0.005594072 -0.004774659 0.006742033 0.0038316489 0.001544205
## 1415675_at 0.002248148 -0.001031133 -0.004794915 0.0037296423 -0.007536217
## PC20 PC21 PC22 PC23
## 1415670_at -0.0003459773 0.005870296 -5.150112e-03 -0.0101484441
## 1415671_at -0.0058257529 -0.005988319 -7.323003e-05 -0.0034637618
## 1415672_at 0.0015487991 -0.007711192 -2.522141e-03 -0.0005277051
## 1415673_at 0.0091644327 -0.012221084 -1.120458e-02 -0.0086279261
## 1415674_a_at 0.0072802517 0.007684839 1.105464e-02 -0.0112289638
## 1415675_at -0.0010487991 0.001723039 -4.288638e-04 0.0007629944
## PC24 PC25 PC26 PC27
## 1415670_at 0.0073183968 0.004279932 0.001592531 -0.0003035332
## 1415671_at 0.0006629165 0.001967735 0.000900362 -0.0028012014
## 1415672_at -0.0072760014 -0.004312205 -0.002955334 -0.0023008907
## 1415673_at -0.0071533392 -0.002311212 -0.002530873 0.0016295761
## 1415674_a_at -0.0009333477 0.003766611 0.006412702 0.0072702217
## 1415675_at 0.0028385323 0.002066559 0.005118293 0.0038765246
## PC28 PC29 PC30 PC31
## 1415670_at -0.0169133321 0.0049902657 0.0015767834 -0.0043764220
## 1415671_at -0.0001577811 0.0050894842 -0.0009716651 -0.0025360045
## 1415672_at 0.0048648207 0.0016042690 -0.0006260298 -0.0037836881
## 1415673_at -0.0043930604 0.0089853585 0.0053592638 -0.0083469057
## 1415674_a_at 0.0010762637 -0.0006644469 0.0068933666 -0.0006637386
## 1415675_at -0.0006258895 0.0083771062 -0.0051551459 0.0043437967
## PC32 PC33 PC34 PC35
## 1415670_at -0.0132427349 -0.0125690143 0.0029455659 -0.0023420230
## 1415671_at 0.0028288149 -0.0085382545 -0.0009683719 0.0018491542
## 1415672_at -0.0040777161 0.0005711196 -0.0006504036 -0.0022125457
## 1415673_at -0.0008271518 0.0017922902 -0.0060014204 -0.0033619356
## 1415674_a_at 0.0007189899 -0.0025768829 0.0084655447 -0.0063167693
## 1415675_at 0.0058050368 -0.0022522456 0.0083507698 0.0009909198
## PC36 PC37 PC38 PC39 PC40
## 1415670_at -0.014843580 0.0033021159 4.154909e-03 0.007139855 0.007101742
## 1415671_at -0.002249954 -0.0029423748 3.767350e-03 -0.002140462 0.001088846
## 1415672_at -0.001644227 0.0006001297 1.574321e-04 0.004835232 -0.001709069
## 1415673_at -0.004368242 -0.0086477183 -2.005622e-03 0.003049362 -0.004304558
## 1415674_a_at -0.004540805 -0.0004888262 -5.644571e-05 -0.010049749 -0.003984100
## 1415675_at -0.002238255 -0.0051113461 -3.533706e-03 0.005118077 -0.006987603
## PC41 PC42 PC43 PC44
## 1415670_at 0.0072944299 -0.0077032356 -0.0083100172 0.009692490
## 1415671_at 0.0004821372 0.0037687170 0.0019168201 0.005691384
## 1415672_at -0.0012185210 0.0060552842 0.0062198948 -0.002881595
## 1415673_at -0.0026912620 0.0143050771 -0.0038062107 0.016330662
## 1415674_a_at 0.0088863583 0.0067148024 -0.0024954014 0.003030388
## 1415675_at -0.0072624360 -0.0005929378 -0.0003836008 -0.009955271
## PC45 PC46 PC47 PC48
## 1415670_at 0.0008510451 -0.0023521812 0.0007497025 -0.0015339174
## 1415671_at -0.0038953526 -0.0010845030 0.0038389205 0.0033034045
## 1415672_at -0.0001807478 0.0007937296 0.0001839559 0.0034370703
## 1415673_at -0.0095253776 0.0065479723 -0.0038568316 -0.0027278611
## 1415674_a_at 0.0050113621 -0.0066850609 -0.0020586133 -0.0015481267
## 1415675_at -0.0026880595 0.0087559893 -0.0019918024 0.0005238969
## PC49 PC50 PC51 PC52
## 1415670_at 2.570830e-03 0.0038679054 0.006562584 0.0111478424
## 1415671_at 1.307988e-04 0.0056195136 0.002429074 0.0068277842
## 1415672_at -2.199466e-03 -0.0014555394 0.003212533 0.0058276901
## 1415673_at -2.788902e-03 -0.0029146156 -0.007245381 0.0007436973
## 1415674_a_at 5.945196e-03 0.0084261738 0.008550839 -0.0016002492
## 1415675_at 2.966112e-05 0.0005442661 -0.007366196 -0.0012449685
## PC53 PC54 PC55 PC56
## 1415670_at -0.003000186 -4.381552e-03 -0.0008523147 -0.0007264979
## 1415671_at -0.002293535 5.118472e-03 -0.0040931497 -0.0014534872
## 1415672_at 0.002995610 -1.833653e-05 -0.0005871680 -0.0046758160
## 1415673_at -0.006887592 -2.262716e-03 0.0024347304 -0.0067067481
## 1415674_a_at 0.001611599 1.063608e-02 0.0063212849 -0.0063912600
## 1415675_at -0.006093395 3.147469e-03 -0.0027669280 0.0024725321
## PC57 PC58 PC59 PC60
## 1415670_at -0.0117290929 -0.002809904 -0.0023605376 0.011005580
## 1415671_at 0.0002450397 -0.010001506 0.0019575104 -0.005691451
## 1415672_at -0.0031492060 0.003782941 -0.0041915031 -0.003626497
## 1415673_at -0.0057788760 0.011241647 0.0104881307 0.010600036
## 1415674_a_at 0.0072171738 -0.003169918 -0.0080582938 -0.002578921
## 1415675_at 0.0015803825 -0.003206346 -0.0009206298 0.001634349
## PC61 PC62 PC63 PC64
## 1415670_at -0.0014818921 -0.0030730713 -0.014675511 0.001425725
## 1415671_at 0.0009040080 -0.0019327206 0.003312589 0.003848085
## 1415672_at 0.0009793205 0.0003033953 -0.002549704 -0.006455380
## 1415673_at 0.0003951108 0.0025646381 0.004046998 0.002481143
## 1415674_a_at -0.0053331434 0.0044613364 0.001754157 0.005437242
## 1415675_at 0.0064818214 0.0002527017 -0.004324612 0.001910406
## PC65 PC66
## 1415670_at -0.0053754489 -0.08169656
## 1415671_at 0.0029983941 -0.10659056
## 1415672_at -0.0044101609 0.10272359
## 1415673_at -0.0049158855 -0.11225889
## 1415674_a_at -0.0184980378 0.24311582
## 1415675_at -0.0005484727 0.03042199
7.4b Make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.
fviz_pca_ind(xwt_pca) + ggtitle("") + ylim(c(-2.5,5.7))
## Warning: Removed 61 rows containing missing values (geom_point).
## Warning: Removed 61 rows containing missing values (geom_text).